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I. PRELIMINARIES 



A. INTRODUCTION 

With the advent of modern computers, the implementation of numerical methods 
for the solution of systems of ordinary differential equations has become commonplace. 
Rather than developing new methods, the task at hand has become making current 
methods more efficient by reducing the amount of computer time needed to solve a 
system or by increasing the accuracy of the results. Previously, efforts to accomplish 
this task were centered on modifying existing algorithms. 

A recent advance in computer architecture, the advent of parallel processing, has 
made it theoretically feasible to reduce the time required to integrate a system of n 
differential equations by a factor of n, assuming the parallel processor computer 
possesses n or more processors. This is a very significant time savings compared to 
those previously realized. This savings is provided by the parallel processor's ability to 
perform n different tasks, e.g. integration of n differential equations, simultaneously 
rather than sequentially as is done with a computer possessing a single processor. 

In this thesis, one established method for solving systems of differential 
equations, the Runge-Kutta-Fehlberg method, is adapted for parallel processing on an 
Intel Scientific Computer iPSC Concurrent Supercomputer. The algorithm is then 
evaluated using a standardized collection of systems of equations. It is found, 
however, that this type of parallel processor is not suited for this purpose due to the 
communications overhead required. As background, short developments of ordinary 
differential equations and numerical methods are presented. 

B. ORDINARY DIFFERENTIAL EQUATIONS 

An equation that involves derivatives or differentials of a function or functions is 
called a differential equation. Differential equations are further classified as ordinary 
or partial differential equations. If the equation is a function of ordinary derivatives, it 
is an ordinary differential equation (ODE). If it is a function of partial derivatives, it is 
a partial differential equation (PDE). The order of a differential equation is the value 
of the highest derivative which appears in the differential equation. For the purposes 
of this thesis, the only concern is that of ordinary differential equations. 
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A first-order ordinary differential equation is the simplest case. Consider 



y' = f(x.y). 



( 1 . 1 ) 



This equation is a first-order ordinary differential equation. The general solution of 1.1 
is a one-parameter family of curves. To select one member from this family, it is 
necessary to specify an initial value. That is to say the initial value of the dependent 
variable is specified for any value of the independent variable. An example of this 
would be 



where the y- (i = 1, 2. .... n) are functions of the independent variable x and f- (i = 
1, 2, . . . , n) are functions of the x, Vj, . . . , y . The solution to 1.3 will be a family of 
ordered n-tuples of the form 



Again, to select one member of this family of n-tuples, it is necessary to have an initial 
value. In this case, the initial value problem becomes a vector equation 



y' = f(x,y), y(x 0 ) = y Q . 



( 1 . 2 ) 



A system of n first-order equations is of the form 



>v = f i( x « yj. >2’ • • • - y n > 
>y = fyx >h >2’ — y n > 



(1-3) 





(1-4) 



y' = f(x,y), y(x 0 ) = c 0 . 



(1.5) 



An nth-order ordinary differential equation of the form 




( 1 . 6 ) 
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is solved by converting it into a set of n simultaneous first-order differential equations 
of the form 

U 1 = u 2 = f l( x ’ u l* • 

u 2 = u 3 = f 2^ x > u l- • 



u n- 1 = u n = f n-l( x ’ u l> • • • - V 

V = §' x ’ u l> • • • ’ V 

by letting Uj = y , U 2 = y^^, . . . , u n = y( n '^); by differentiating each of these 
equations: and by substituting for y, y^\ . . . , y( n ‘^ in terms of Uj, U 2 , • • • , u n . In 
order to determine a unique solution to this set of simultaneous equations, initial 
conditions must again be specified. These initial conditions are of the form Uj(c) = 

dj, u->(c) = d-> u n (c) = *^n are obtained from transforming the conditions 

in terms of y and its derivatives. For further discussion of the solution of ordinary 
differential equations, see [Ref. 1]. 

Given a system of differential equations, the problem now becomes one of 
solving the system. Whenever possible, it is desirable to find an explicit solution. 
However, most systems cannot be solved exactly— that is, it is impossible to obtain a 
solution in elementary form. It is because of this characteristic of ordinary differential 
equations that we must turn to numerical methods to obtain the solutions. 



’ V 
• V 



(1.7) 
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II. DEVELOPMENT OF NUMERICAL METHODS 



A. TAYLOR SERIES METHOD 

Although the Taylor series methods are not usually used in practical problems, 
these methods must, however, be understood in order to understand the Runge-Kutta 
methods described in the next sections. In order to solve the initial value problem 
posed in Equation 1.2, we develop the relation between v and x by determining the 
coefficients of the Taylor series in which we expand y about the point x = x Q . If y(x) 
has m+1 continuous derivatives on an interval I containing x n , then by Taylor's 
Formula with remainder, 

y(x) = y(x n ) + y'(x n )(x-x n ) + (l/2!)y"(x n )(x-x n ) 2 4- . . . (2.1) 

+ (l,m!)y( m )(x n )(x-x n ) m + (1 (m+ l)!)y( m+ ^(c)(x-x n ) m+ 1 , 

for some c e (x,x n ). (Thomas and Finney [Ref. 2: pp. 663-665] show a detailed proof 
of this theorem.) If y(x) is a solution to the initial value problem 1.2 and v(x) has 
m+ 1 continuous derivatives, then 



y'(x n ) = f(x n ,y(x n )) 

y”(x n ) - l< ! >(x n ,y(x n )) -f x + f y y -t x + f y f 

y"'(x n ) = f< 2 >(x n ,y(x n )) - f M + t xy y' + (f yx + f ;y y')y' + f y y* (2.2) 

= f + 2ff J -ff 2 -f-ff+f 2 f 
xx xv 1 Hy *x y A y ’ 



where f and its partial derivatives are all evaluated at (x n , y(x n )). 

One could continue in this manner, computing any derivative of y evaluated at x n . 
y( m) (x n ), in terms of f and its partial derivatives evaluated at x n , y(x n ). It is obvious 
to see that for other than reasonably small values of m, the derivatives are usually 
bothersome to compute. For this reason, m in Equation 2.1 is chosen to be reasonably 
small. By letting h n = (x-x n ), y + j is approximated by 



n- 



= >n + f ( x n- >n) h n + y n )h n 2 + . . . 

+ (l/m!)f^ m * 1 >(x n , y n )h n m . 



(2.3) 
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Equation 2.3 represents a single step numerical approximation to the solution of the 
initial value problem 1.2 and is known as the Taylor series method of order m. 1 From 
Equation 2.1, the error of this method is represented by 

E n = (1 (m+ l)l)K m te, y(s))h n m+ l , (2.4) 

where £, e (x n , x n+1 ). 

As seen above, the computational disadvantages of the Taylor series method are 
due to the calculation and evaluation of the derivatives . . . , at (x_, 

y n ). It will be seen in the following sections that a Runge-Kutta method of order m is 
usually as accurate as the Taylor series of order m and is simpler to use. The Taylor 
series method, however, will be shown to be of theoretical value since the order of a 
Runge-Kutta method will be defined using the Taylor series method. 

B. RUNGE-KUTTA METHOD 

The German mathemetician Carl Runge (1856-1927) was the first to develop a 
numerical integration technique designed to approximate the Taylor series method 
without requiring explicit evaluations of derivatives beyond the first while preserving 
the accuracy of the Taylor series method [Ref. 3]. The technique was later improved 
by the German mathematician Martin Kutta (1867-1944) [Ref. 4] and, thus, it was 
named the Runge-Kutta method. 

This technique sets up a problem with undetermined parameters and uses 
evaluations of f(x,y) within the interval (x n , y ) and (x n + i> Y n + j) thus bypassing the 
derivatives of the Taylor series by requiring ffx, y) to be evaluated a number of 
additional times within the interval. The general form of this scheme is 



>'n+ 1 



- n 



v 

+ £ wjkj 

i= 1 



(2.5) 



where v is the number of f(x, y) substitutions, w- are the weighting coefficients, and the 
k- satisfy the sequence 



Published by the English mathematician Brook Taylor (1685-1731) in 1715, 
however, Gregory and Leibnitz knew the series before Taylor, and John Bernoulli had 
published a similar result in 1694 [Ref. 1: p. 106]. 



(2.6) 



k l = hf * x m >'n) 

k 2 = hf(x n + c 2 h, y n + a^k^ 

k 3 = hf(x n + c 3 h,y n + a 31 kj + a 32 k 2 ) 



The values of k can be thought of as estimates of the change in y when x changes a 
value of h. The problem now becomes one of determining the coefficients wj, c-, and 
a-j. Each set of parameters will specify the points (x, y) where f(x. y) is to be 
evaluated. Therefore, this method calculates y n+ j using only a value for y and 
evaluations of f(x, y) at points between x n and x n _|_j. For this reason the method is 
termed self-starting. To obtain specific values for the coefficients, a value for v is 
chosen and y + j is expanded in powers of h such that it agrees as well as possible 
with the solution of the ordinary differential equation found using the Taylor series 
method. (For a complete development see [Ref. 5].) 

A popular example of this method is the classical fourth-order Runge-Kutta 
method. It is given by 

y n - 1 = y n + (l'6)(k 0 + 2k j + 2k 2 + k 3 ), (2.7) 

where k Q = hf(x n , y n ) 

kj = hl(x n + h/2, y n + k 0 /2) 
k 2 = hf(x n + h/2, y n + k r '2) 
k 3 = hf(x n A h, y n + k 2 ) 

In this case, we avoid the derivatives of the Taylor series by performing four function 
evaluations on f(x, y) in the interval (x n , y n ) and (x n+ j, y + |). As was stated earlier, 
the Taylor series method provides an error estimate for other methods. Here, as h goes 
to 0. this method agrees asymptotically with the Taylor series through the Iff term, 
thus, making it a fourth-order method with error term proportional to from 
Equation 2.4. A disadvantage of this method is that an estimate of the local error is 
not readily available to help in choosing a suitable stepsize h. 
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C. RUNGE-KUTTA-FEHLBERG METHOD 

In 1969, Erwin Fehlberg published a variation of the Runge-Kutta method which 
uses an estimate oflocal error to select a proper stepsize [Ref. 6]. For a given value of 
y n , Fehlberg's method computes two estimates of y n+ j using fourth- and fifth-order 
Runge-Kutta formulas. In order to obtain an estimate oflocal error, the two values of 
y n+ j are compared. The stepsize is then adjusted, depending on the local error. 
Fehlberg first uses 

6 

>'n+l = >n + I c i k i ( 2 - 8 > 

i= 1 

where the k- satisfy 



i-1 

k i = h n^ x n + a i h n’ >'n + X Pij k i> 1 k • • • * 6 - ( 2 - 9 ) 

j=l 

This method requires six function evaluations per step. As with the fourth-order 
method discussed in Section II. B. the c- are found by expanding y + ^ in powers of h n 
so that it agrees as well as possible with the Taylor series solution. The coefficients 
determined by Fehlberg are found in Appendix A. Fehlberg found that the two 
expansions match until the h n ^ term, thus, making the method fifth-order. This is a 
departure from the behavior of the nth-order Runge-Kutta methods where n = 1, 2, 3, 
4 which produce (n+ l)st order error. This partially explains the popularity of the 
classical fourth-order method described in the previous section; it takes two more 
function evaluations to obtain one more order of accuracy. Fehlberg's method, 
however, exploits the sixth function evaluation by determining a second value y n+ j* 
using 



>n+ f 






+ I c f k i 

i= 1 



( 2 . 10 ) 
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This value was found to be fourth-order using the method described above. The local 
error is then estimated by 

6 

D n = I ( c i - c f)ki (2-11) 

i= 1 



which is used for stepwise control. [Ref. 7: pp. 129-131] 

Because the Runge-Kutta-Fehlberg method is generally thought of as one of the 
"best methods" available for solving nonstiff systems of equations [Ref. 8], it was 
chosen to adapt for parallel processing. 
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III. DESCRIPTION OF SEQUENTIAL PROGRAM 



A FORTRAN program written by FLA. Watts and L.F. Shampine [Ref. 7: pp. 
132-147] was chosen to be implemented. The program solves initial value problems in 
ordinary differential equations and is based on the Runge-Kutta-Fehlberg method 
described in Section II.C. It is designed to solve non-stiff and mildly stiff systems of 
differential equations when derivative evaluations are inexpensive. The program is 
typically used to integrate from a given initial value to a desired final value but can 
also be used as a one-step integrator. The program consists of a main program along 
with subroutines RKF45, RKFS, and FEHL. The following is a brief description of 
the program as it was developed for sequential processing. 

A. MAIN PROGRAM 

The main program first defines the system of equations to be solved through the 
use of the problem specific subroutine F. Additionally, it defines the system's initial 
conditions and program parameters such as absolute and relative error tolerances, and 
it provides output of data and error messages. Once the system of equations and 
parameters are defined, the main program begins solution of the problem by passing 
information to subroutine RKF45. 

B. SUBROUTINE RKF45 

Once the main program sets up the problem, subroutine RKF45 becomes the 
interfacing routine for the solution of the problem. RKF45 first sets up work arrays 
for storage of information used during integration, thus relieving the user of lengthy 
subroutine calling lists later in the program. It then calls subroutine RKFS, providing 
it with the work arrays. 

C. SUBROUTINES RKFS AND FEHL 

RKFS is the subroutine which, along with subroutine FEHL, performs the 
integration of the system of equations. It first establishes a minimum acceptable 
relative error and a maximum number of function evaluations allowed in order to avoid 
the expense of a user's attempt to obtain an excessive accuracy. It next checks input 
parameters, issuing error flags back to RKF45 as appropriate. Machine epsilon is then 
computed and used in conjunction with the minimum acceptable relative error to limit 
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precision difficulties. Error flags are issued if user specified relative error tolerance is 
too small. Once these preliminary tasks are complete, initialization is performed. This 
includes setting the function evaluation counter to zero and estimating the initial 
integration stepsize H. Throughout the program, the stepsize is not allowed to become 
smaller than 26 units of roundoff in the dependent variable T. 

Once stepsize is computed and checked, subroutine FEHL is called. Subroutine 
FEHL contains the heart of the integrator in the form of the FORTRAN equivalent of 
the Runge-Kutta-Fehlberg formulas represented in Equations 2.8-2.10. Because of its 
importance, subroutine FEF1L is included as Appendix B. FEHL performs the 
integration and returns to RKFS which in turn implements Equation 2.11 in order to 
determine local error and test to see if the integration step was successful. If 
unsuccessful, the stepsize is reduced and integration is attempted again. If successful, 
the solution at T + H is stored and the components of the system of equations are 
reevaluated at T+ H using subroutine F. The function evaluation counter is changed 
to reflect the function evaluations performed in FEHL. This integration process 
continues until the final location is reached causing program flow to return to the main 
program which provides output of the final solution. 
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IV. IMPLEMENTATION 



A. METHODOLOGY 

In order to test the hypothesis that the time required to integrate a system of n 
ordinary differential equations can be reduced by a factor of n through parallel 
processing, the program described in Chapter III was implemented on an Intel 
Scientific Computer iPSC Concurrent Supercomputer. Appendix C contains a 
technical description of this computer. The particular computer used in this thesis 
possesses 16 processors thus making it a 16-node or 4-dimensional hypercube, as 
explained in the appendix. 

The general scheme of the testing of this hypothesis was to choose systems to be 
integrated from a standard suite of problems used to test the performance of other 
integrator programs [Ref. 9: pp. 617-621], Four systems were chosen in order to test 
performance of both small and moderate sized systems. As was done in the reference, 
the interval of integration for all implementations was [0, 20]. The constraint of only 
examining small and moderate sized systems was imparted due to the number of 
available independent processors being 16 or less. The systems chosen consist of 2 
equation, 3 equation, 4 equation, and 10 equation systems. Each problem was first 
solved sequentially and timed on a single processor of the hypercube by adapting the 
code previously described, thus providing a sequential time standard to be compared 
with parallel run times. Next, the Runge-Kutta-Fehlberg algorithm was adapted for 
parallel processing using varying schemes to optimize performance of the hypercube. 
The systems were then solved using these schemes and timed. Detailed descriptions of 
each system's implementation are contained in this chapter following a discussion of 
internodal communication times. 

B. INTERNODAL COMMUNICATION TIMES 

The theoretical feasibility of reducing integration time by a factor of n assumes 
that the time required to pass information between processors is minimal when 
compared to the speed of computation. Time spent in communicating is a critical 
factor in the implementation of an integrator of systems of ordinary differential 
equations since, after each integration step, the solutions to each component of the 
system must be combined at a central location. This requires, for every step in the 
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integration, a message pass back to a central node from each node tasked with 
processing a separate component of the system of equations. For this reason, the 
hypercube's internodal communication times were empirically determined for later use 
in minimizing total time spent in communication when implementing the parallel 
algorithms. 

Based on the topology of the 4-dimensional hypercube, as is discussed in 
Appendix C, it was decided to determine communication time for a message sent round 
trip from the host to the cube as well as between two nodes of distance 1, 2, 3, and 4 
from each other. For these timings, a message of length 4 bytes was sent round trip 
1000 times and an average round trip time was calculated. This experiment was 
performed a total of ten times for each and a final average round trip time was found. 
From the host to the cube, average round trip time was .0230S seconds. Average 
round trip times between nodes whose internodal distances are 1, 2. 3, and 4 were 
.002709 seconds. .005317 seconds, .006615 seconds, and .007797 seconds respectively. 
In addition, message passing was also timed for messages of length 16, 32, 40, and 80 
bytes. These times were similar to those found using the 4 byte long message, thus 
showing that internodal communication times are not a function of message length. 

Three conclusions may be drawn from these results. First, when minimization of 
communications time is desired, the host should be used only to house the main 
program and provide input and output. It should not be used as a "seventeenth" node 
due to the high relative order of host to cube communication time as compared with 
internodal communication times. Secondly, to minimize the total communication time 
in a given parallel algorithm, internodal distances must be considered when assigning 
tasks to specific nodes. Thus, in order to attain minimum program run times, parallel 
algorithms must create an optimum hypercube topology for a given system of 
equations based on internodal distances. Thirdly, since communication time is not 
message length dependent, it can also be concluded that a single long message is 
preferred to several short messages containing the same information. 

C. A TWO EQUATION SYSTEM 

1. System Description 

Equation 4.1 depicts the two equation system chosen. It represents the 
growth of two conflicting populations. 

>Y = 2(vj - yjyj), y !<0) = 1, (4.1) 

>' 2 ' = ~ (>'2 ~ y ^ A )- >’ 2 (0 ) = 3 - 
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2. Sequential Implementation 

Sequential implementation was accomplished by adapting the integrator 
program described in Chapter III to run at node 0. This adaptation includes a 
subroutine F tailored to Equation 4.1. A main program, running at the host, loaded 
the integrator program to node 0 and provided output of integration results and 
sequential run times. It should be noted that these run times do not include time 
necessary to load the integrator program to node 0. 

3. Parallel Implementation I 

The parallelization scheme thought to be most natural, i.e. sending component 
1 to node 1, component 2 to node 2, ... , component n to node n, was next 
implemented. This scheme was termed "shotgun" and consists of a main program at 
the host, the modified integrator program less subroutine FEHL at node 0, and node 
programs at nodes 1 and 2. Excerpts from the node 0 program and the node 1 and 
node 2 programs are listed in Appendix D. 

Again, the main program loads the node programs and outputs integration 
results and parallel run times. The node 0 program is the driver program for the 
integration. It has been modified by removing subroutine FEHL and adding 
communications with nodes 1 and 2 which evaluate components 1 and 2 of the system 
respectively. It should be noted that these node assignments were made in order to 
insure internodal distances were minimized. Referring to Figure 4.1, the reason that 
this process was termed "shotgun" was due to the fact that the node 0 program sends 
and receives information from the nodes processing the component computations in a 
"shotgun" fashion. 

A typical integration step takes place as in the sequential program except that 
instead of calling subroutine FEHL, the node 0 program first sends the component 
nodes the initial y vector. Although this message passing is sequential, the 
computation at the nodes does overlap, providing concurrent component processing. 
The component nodes perform the first function evaluation, using subroutine FNODE, 
and the first Runge-Kutta-Fehlberg step. This information is then sent back to node 0. 
This process takes place five times per integration step. Node 0 then computes a 
solution at the new location T+H, computes an appropriate stepsize, and again calls 
the component nodes to continue integration. 

In general, the "shotgun" scheme may be extended to a 16-node hvpercube to 
integrate systems up to size fifteen. For this scheme, the cost in number of message 
transmissions is represented by Equation 4.2 . 
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Figure 4.1 Shotgun Scheme. 

COST = 2 x NEQN x NFE (4.2) 

where NEQN is the size of the system and NFE is the number of function evaluations. 
Since the number of function evaluations increases as the error tolerances are 
decreased, integration times can be expected to increase due both to the increased 
number of computations required, as well as the increased communications overhead 
requirement. For this system of two equations, the communications overhead in terms 
of the number of message passes performed is four times the number of function 
evaluations. 

4. Parallel Implementation II 

In order to reduce the communications overhead present in the "shotgun" 
implementation, a second integration scheme was developed. This new scheme, termed 
"flip-flop", involves sending information from node 0 to nodes 1 and 3 and then 
performing the integration step through a series of computations at these two nodes 
and message passes between them. Figure 4.2 depicts this scheme and program 
excerpts are contained in Appendix E. 

Again the program at the host provides loading of the three node programs 
and output of the results. The node 0 program is the driver for the integration. The 
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Figure 4.2 Flip-flop Scheme. 

integrator segment of the original sequential code was again modified. It now only 
sends and receives one message from nodes 1 and 3 for a total of four message passes 
per integration step. Node 1 computes function evaluations for the first component, 
node 3 for the second. First, node 0 sends the initial information to each component 
node and these nodes compute the first function evaluation for the respective 
component. Once completed, the two component nodes exchange information in a 
"flip-flop" manner and then proceed with their respective second function evaluation. 
This process continues until all of the function evaluations in the Runge-Kutta- 
Fchlberg scheme are completed at the component nodes, at which time nodes 1 and 3 
transmit their final component solutions back to node 0. The node 0 integrator 
program proceeds by advancing the integration as was done in previous programs. 

In terms of communication overhead, the "flip-flop" scheme is superior to the 
"shotgun" scheme. The cost, in terms of message transmissions, is represented by 
Equation 4.3 . 



COST = 2.S x NFE 



(4.3) 



where XFE is the number of function evaluations. The cost is derived from the fact 
that fourteen message transmissions occur during the computation of five function 
evaluations. These include ten between nodes 1 and 3, two between node 0 and node 
1, and two between node 0 and node 3, as shown in Figure 4.2 . This cost is thirty 
percent of the cost of the "shotgun" algorithm for the two equation system. 

D. A THREE EQUATION SYSTEM 

1. System Description 

The three equation system chosen is depicted in Equation 4.4 and represents a 
linear chemical reaction. 



= — + y 2 > 


Yl(0) = 2, 




= >'l + 2 >’2 + >> 


y 2 (0) = 0, 


(4.4) 


>'3 = >2 ~ >’3’ 


y 3 (0) = 1. 





2. Sequential Implementation 

Sequential implementation was performed in the same manner described for 
the two equation system and by modifying the subroutine F to reflect Equation 4.4 . 

3. Parallel Implementation I 

The three equation system was first run parallel using the "shotgun" 
integration scheme with node 0 for the integrator program and nodes 1, 2, and 4 for 
the component programs. Nodes 1, 2. and 4 were chosen to again minimize internodal 
distances. Program excerpts are not included for this implementation as they are minor 
modifications of those found in Appendix D. From Equation 4.2, it can be determined 
that the the cost of solving the three equation system by the "shotgun" method is equal 
to six times the number of function evaluations. 

4. Parallel Implementation II 

An additional scheme was developed to decrease the total integration time for 
the three equation system. It was termed the "train" scheme due to its use of messages 
in the form of a real valued vector of size seventeen which is passed from node to node 
during integration. The vector contains information necessary for integration including 
values for T, stepsize H, y values, v' values and computed derivatives. Upon the 
message's arrival at a node, the node performs function evaluations for its respective 
component, updates information in the message, and passes it on. An analogy can be 
made between the process of updating information in the message and filling cars in a 
train; thus the name "train." 
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Figure 4.3 Three Equation Train Scheme. 

Figure 4.3 depicts this integration scheme and program excerpts are listed in 
Appendix F. Node 0 is used to run the driver program while the three components of 
the system are processed at nodes 1, 2, and 3. Nodes 1 and 2 were chosen since they 
are of minimal distance to node 0, whereas, node 3 was chosen for its adjacency to 
node 1. 

Referring to Figure 4.3 , the "train" scheme will be explained in terms of 
message pass time frames, meaning the time frame during which a message is passed 
between two nodes. In the first time frame, node 0 computes the first Runge-Kutta- 
Fehlberg function evaluation and sends the information "train" to node 1 which 
performs the second function evaluation. While this computation is being performed, 
the second time frame begins when node 0 sends the integration information to node 2. 
Upon completion of its computation, node 1 updates the information pertaining to its 
component and sends the "train" to node 3. At this point nodes 2 and 3 are 
performing their computations for the first function evaluation. As a worst case, it is 
assumed that node 3 sends its updated information to node 0 during the third time 
frame and that node 2 returns its information during a fourth time frame. Throughout 
this process, as information is received at node 0, the updated values are placed in a 
buffer until all component nodes return their updates. This completes one function 
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